The two Youtube channels that I chose were two that would constantly pop up in my recommended, because I would watch videos from them frequently. I felt that although the type of content on their channels were different, with Vox’s videos being infographics and education vs The Dodo just being wholesome animal moments, that they would have a lot of differences in data, particularly engagment and video lengths. However, I quickly realised that they have many similarities, and wanted to compare the differences in the lengths of different aspects, including title and video lengths.
I decided to focus on these variables as I thought it would be interesting to explore these lengths, and I wanted to be able to use different plots as well. Therefore I used geom_jitter, geom_boxplot, and geom_col to create my different graphs. I used geom_jitter to plot the number of characters in each video title, as there were many different values for them that I wasn’t able to plot in a simple bar graph. I used geom_col for my graph of the average duration of videos from each channel, as it was more straightforwards and was only 2 values. Lastly, I used geom_boxplot to plot the range of duration of videos published. I was originally to make a boxplot of the engagement data such as number of comments, however I found that there was a significant outlier on a Vox video that made the data unreadable.
I tried to make this creative through creating my theme of graphs and using the theme in all the graphs. I was able to create this theme through using theme() and then setting the background and panel colours and adding margins. I felt that this gave my graphs a much better look than just simply using the other variables such as just colour=, and it looked more cohesive in the colour scheme.
The most important idea I learnt from this module is creating visualisations that look good and cohesive as well as relay accurate data. I loved how I learnt about how to change the aesthetics of the graphs I created through creating a theme and using it, as well as using margins to create a less crowded feel of the graphs. Another thing I am super excited about is that I can now create all sorts of different graphs such as jitter plots and boxplots, which I previously couldn’t do. Next, I am excited to learn more about creating nice looking reports and maximising my CSS usage within html reports.
library(tidyverse)
youtube_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vS2ReO1eYhuD2XkSGsQ0Tr2OpPpvU4R6aVB-b-Il0w3Ou5hQADva-ciJIx2P8YLcXhyecDPwy2sB-cs/pub?output=csv")
# colours and themes
my_colours <- c("#c7eae4", "#a7e8bd", "#fcbcb8", "#efa7a7", "#ffd972")
my_theme <- theme(plot.background = element_rect(fill = my_colours[5]),
panel.background = element_rect(fill = my_colours[5]),
plot.margin = margin(1, 1, 1, 1, "cm"))
# mutating data
channel_data <- youtube_data %>%
mutate(channel_names = case_when(
str_detect(channelName, "@Vox") ~ "vox",
str_detect(channelName, "@TheDodo") ~ "thedodo"
))
title_nchar <- channel_data %>%
group_by(title) %>%
mutate(title_chars = nchar(title))
# making jitter plot
plot_1 <- ggplot() +
geom_jitter(data = title_nchar,
aes(x = title_chars,
y = channel_names),
height = 0.3,
colour = "#efa7a7") +
labs(title = "Number of characters in video titles",
x = "Number of characters",
y = "Channel name") +
my_theme
ggsave("plot1.png", plot = plot_1, width = 6, height = 4, units = "in")
# finding mean of data
channel_diffs <- channel_data %>%
mutate(vox_dodo = ifelse(str_detect(str_to_lower(channel_names),
"vox"),
"Vox",
"Dodo"))
channel_diffs <- channel_diffs %>%
group_by(vox_dodo) %>%
summarise(mean_duration = mean(duration))
plot_2 <- channel_diffs %>%
ggplot() +
geom_col(aes(x = vox_dodo,
y = mean_duration),
fill = "#a7e8bd") +
labs(title = "Average length of videos",
x = "Channel",
y = "Mean Duration in seconds") +
my_theme
ggsave("plot2.png", plot = plot_2, width = 6, height = 4, units = "in")
# boxplot of comments
plot_3 <- title_nchar %>%
ggplot() +
geom_boxplot(aes(x = duration,
y = channel_names),
fill = "#fcbcb8") +
labs(title = "Range of video duration per channel",
x = "Duration in seconds",
y = "Channel Names") +
my_theme
ggsave("plot3.png", plot = plot_3, width = 6, height = 4, units = "in")
# creating frames
library(magick)
plot_1 <- image_read("plot1.png") %>%
image_scale(700)
plot_2 <- image_read("plot2.png") %>%
image_scale(700)
plot_3 <- image_read("plot3.png") %>%
image_scale(700)
# frame 1 creation
image_colour <- image_blank(width = 1200,
height = 400,
color = my_colours[5])
text_1 <- "Lengths of different whats?" %>%
str_wrap(20)
frame_1 <- image_colour %>%
image_annotate(text_1, size = 50, gravity = "center", font = "Avenir Next Condensed") %>%
image_scale(1400)
frame_1
# frame 2
image_colour2 <- image_blank(width = 600,
height = 400,
color = my_colours[1])
text_2 <- "I wanted to find out all the different lengths of the title of videos from Vox and The Dodo, as I found that The Dodo often had much longer titles than Vox, and most of them also were in Spanish which I had not realised before." %>%
str_wrap(35)
square_1 <- image_colour2 %>%
image_annotate(text_2, size = 30, gravity = "center", font = "Avenir Next Condensed") %>%
image_scale(700)
row_2 <- c(plot_1, square_1)
frame_2 <- image_append(row_2)
frame_2
# frame 3
text_3 <- "The next 'length' I wanted to explore was the average duration of videos published from both channels. From first glance, they seemed very similar, however through exploration I quickly realised that the channels have very different average lengths."%>%
str_wrap(35)
square_2 <- image_colour2 %>%
image_annotate(text_3, size = 30, gravity = "center", font = "Avenir Next Condensed") %>%
image_scale(700)
row_3 <- c(plot_2, square_2)
frame_3 <- image_append(row_3)
frame_3
# frame 4
text_4 <- "The last 'length' that was of interest to me was the range of durations of videos from the channels. I wanted to see this data in the form of boxplots, and it was interesting to see that there was a single outlier that skewed the whole graph." %>%
str_wrap(35)
square_3 <- image_colour2 %>%
image_annotate(text_4, size = 30, gravity = "center", font = "Avenir Next Condensed") %>%
image_scale(700)
row_4 <- c(plot_3, square_3)
frame_4 <- image_append(row_4)
frame_4
# frame 5
image_colour <- image_blank(width = 1200,
height = 400,
color = my_colours[5])
text_5 <- "Overall I have learnt that data from YouTube videos and different channels are all very different. Even within one channel, data such as durations can fluctuate massively just based on what type of content the video is. There also is significant relationship of which language a video is published in and the number of characters in the title of the video.
" %>%
str_wrap(65)
frame_5 <- image_colour %>%
image_annotate(text_5, size = 35, gravity = "center", font = "Avenir Next Condensed") %>%
image_scale(1400)
frame_5
# making gif
frames <- c(rep(frame_1, 8), rep(frame_2, 8), rep(frame_3, 8), rep(frame_4, 8), rep(frame_5,8))
data_story <- image_animate(frames, fps = 1)
data_story
image_write(data_story, "data_story.gif")